-
Notifications
You must be signed in to change notification settings - Fork 52
Add Static FP8 KV Support #737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds FP8 KV cache quantization support to the AutoRound quantization library. This feature enables quantizing the key-value cache in attention mechanisms to FP8 format for improved memory efficiency.
- Adds a new
enable_fp8_kv
parameter to control FP8 KV cache quantization - Implements FP8 KV cache infrastructure with calibration and quantization context
- Updates test coverage to verify FP8 KV cache serialization and functionality
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
auto_round/autoround.py | Adds enable_fp8_kv parameter and context manager integration |
auto_round/experimental/fp8_kv_cache.py | Core FP8 KV cache implementation with quantization logic |
test/test_cpu/test_export.py | Test updates to verify FP8 KV cache export functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Please note that kv cache does not affect tuning as we only use forward in tuning. So the first change is moving the arg from init to save_quantized? |
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Since the other non-tuning args, such as |
Signed-off-by: yiliu30 <[email protected]>
Due to the large number of arguments in init, only the arguments likely to be used are explicitly listed, while the others are hidden in **kwargs. I assume that the static KV cache is not frequently used, so we can keep it **kwargs for now. |
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Agree, updated. |
the code could be improved by providing another arg like static_kv_dtype to replace the exstiing enable_static_kv to support more dtype, fp4, int4/int8, etc. |
Since only FP8 KV is supported in the vLLM end, I’d prefer to keep it as is in this PR. I’m open to adding support for more data types if needed. |
Ok, then how about changing the arg now to make the API more backward compatibility in case we need to support other dytpes in the future. only support bf16/fp16/fp8 for now, while for 16 bits, nothing needs to do in the code |
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Sure, updated. |
No description provided.